Overview and Dataset Description

The analysis here is on dataset of hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis, MN and St Paul, MN. This dataset was taken from https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume . This analysis will help in better urban planning, maintenance planning, lowering congestion level and increasing driver safety by forecasting the traffic volume for a future timeframe.

Column Names: Datatype: Meanings

The available data is hourly from 2nd Oct 2012 to 30th Sep 2018. So, we have hourly data for 6 years. While we will analyse old years data for traffic patern, we will mostly use last 1-2 years data (2016 Oct to 2018 Sep) for our forecast.

##    holiday               temp          rain_1h            snow_1h         
##  Length:48204       Min.   :  0.0   Min.   :   0.000   Min.   :0.0000000  
##  Class :character   1st Qu.:272.2   1st Qu.:   0.000   1st Qu.:0.0000000  
##  Mode  :character   Median :282.4   Median :   0.000   Median :0.0000000  
##                     Mean   :281.2   Mean   :   0.334   Mean   :0.0002224  
##                     3rd Qu.:291.8   3rd Qu.:   0.000   3rd Qu.:0.0000000  
##                     Max.   :310.1   Max.   :9831.300   Max.   :0.5100000  
##    clouds_all     weather_main       weather_description  date_time        
##  Min.   :  0.00   Length:48204       Length:48204        Length:48204      
##  1st Qu.:  1.00   Class :character   Class :character    Class :character  
##  Median : 64.00   Mode  :character   Mode  :character    Mode  :character  
##  Mean   : 49.36                                                            
##  3rd Qu.: 90.00                                                            
##  Max.   :100.00                                                            
##  traffic_volume  record_date          date_time.1                 
##  Min.   :   0   Min.   :2012-10-02   Min.   :2012-10-02 09:00:00  
##  1st Qu.:1193   1st Qu.:2014-02-06   1st Qu.:2014-02-06 11:45:00  
##  Median :3380   Median :2016-06-11   Median :2016-06-11 03:30:00  
##  Mean   :3260   Mean   :2016-01-04   Mean   :2016-01-05 10:06:31  
##  3rd Qu.:4933   3rd Qu.:2017-08-11   3rd Qu.:2017-08-11 06:00:00  
##  Max.   :7280   Max.   :2018-09-30   Max.   :2018-09-30 23:00:00  
##   week_name          month_name       
##  Length:48204       Length:48204      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

From the above plot, we can see that there is no missing value in existing records. But since data volume is huge, let’s analyse if obervation for any date or month is missing.

## [1] "Number of observation for Oct, 2012 Oct to Sep, 2013 Sep is : 9219"
## [1] "Number of observation for Oct, 2013 Oct to Sep, 2014 Sep is : 6752"
## [1] "Number of observation for Oct, 2014 Oct to Sep, 2015 Sep is : 2731"
## [1] "Number of observation for Oct, 2015 Oct to Sep, 2016 Sep is : 8307"
## [1] "Number of observation for Oct, 2016 Oct to Sep, 2017 Sep is : 10593"
## [1] "Number of observation for Oct, 2017 Oct to Sep, 2018 Sep is : 10602"

We know that for an year there can be 8760 records (or 8784 for 2016 being a leap year). By looking at data for 2014, 2015, it seems a lot of observations are missing. 2013 being very old compared to latest data, we decided to ignore this year’s data as well.

So, we will go ahead with data for 2016, 2017 and 2018 to analyse our use case.

## [1] "Summary of data with non zero snow variable is below:"
##    holiday               temp          rain_1h           snow_1h      
##  Length:63          Min.   :266.7   Min.   :0.00000   Min.   :0.0500  
##  Class :character   1st Qu.:268.9   1st Qu.:0.00000   1st Qu.:0.0600  
##  Mode  :character   Median :273.2   Median :0.00000   Median :0.1000  
##                     Mean   :271.3   Mean   :0.06222   Mean   :0.1702  
##                     3rd Qu.:273.7   3rd Qu.:0.00000   3rd Qu.:0.2500  
##                     Max.   :274.3   Max.   :0.98000   Max.   :0.5100  
##    clouds_all    weather_main       weather_description  date_time        
##  Min.   :75.00   Length:63          Length:63           Length:63         
##  1st Qu.:90.00   Class :character   Class :character    Class :character  
##  Median :90.00   Mode  :character   Mode  :character    Mode  :character  
##  Mean   :88.57                                                            
##  3rd Qu.:90.00                                                            
##  Max.   :90.00                                                            
##  traffic_volume  record_date          date_time.1                 
##  Min.   : 359   Min.   :2015-12-23   Min.   :2015-12-23 12:00:00  
##  1st Qu.: 888   1st Qu.:2015-12-28   1st Qu.:2015-12-28 21:00:00  
##  Median :2979   Median :2015-12-29   Median :2015-12-29 07:00:00  
##  Mean   :2939   Mean   :2015-12-31   Mean   :2015-12-31 14:47:37  
##  3rd Qu.:4933   3rd Qu.:2016-01-07   3rd Qu.:2016-01-07 16:00:00  
##  Max.   :6438   Max.   :2016-01-08   Max.   :2016-01-08 15:00:00  
##   week_name          month_name       
##  Length:63          Length:63         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Only 63 records has snow value greater than 0. This looks like data collection error because snow for only 63 days in Minneapolis state (over 6 years) does not look real. So, we can remove this variable from analysis to avoid error prone extrapolating.

Also, we saw that 2017 and 2018 had more than 8760 records. By analysing data, we found some duplicate records. Hence, we will create final dataframe after removing those duplicates.

c. Stationary / Non-Stationary

We see from the Spectral densities plot that there exist seasonal trend in the data. Also, the ACF’s are damping exponentially.

We see seasonality of 24, 12, 7 in the data. So, data is not independent of time. Daily mean, weekly mean can vary over time based on month of the year. Hence, this model is non-stationary.

However, since the realization is really clumsy to check for any seasonality in the data, let’s check by picking less records from different year’s dataset.

d. ACFs and Spectral Densities just to explore

## [1] "Realization of hourly data in chunks of 20 days are below for different years"

From the above plots, we can see that values are almost getting repeated every 24 hours. Also, we see that after every 5 days, traffic is little less for 2 days. This is another trend. Based on our domain knowledge when we know people travel more during weekdays for work (mostly 1 person per vehicle where as weekend trips are mostly in groups)

e. At least 2 candidate ARMA / ARIMA models

aic5.wge(traffic_volume_final$traffic_volume, type = "aic")
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 17    5    1   13.04232
## 11    3    1   13.04902
## 9     2    2   13.05508
## 16    5    0   13.06028
## 13    4    0   13.06812
aic5.wge(traffic_volume_final$traffic_volume, type = "bic")
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  bic
##       p    q        bic
## 17    5    1   13.04429
## 11    3    1   13.05042
## 9     2    2   13.05649
## 16    5    0   13.06197
## 8     2    1   13.06944

AIC estimate picked ARMA(5,1) as best model. Here we are estimating next 72 observation which is forecast for last 3 days.

ARMA model

## 
## Coefficients of Original polynomial:  
## 2.1042 -1.4630 0.2731 0.1211 -0.0808 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.7940B+0.8479B^2    1.0580+-0.2452i      0.9208       0.0362
## 1-0.6503B+0.2800B^2    1.1615+-1.4909i      0.5291       0.1447
## 1+0.3402B             -2.9395               0.3402       0.5000
##   
## 

## [1] "ASE for above ARMA(5,1) model is : 2475821.2232016"

Seasonal model

## 
## Coefficients of Original polynomial:  
## 1.2799 -0.3668 -0.0670 0.0207 0.0479 -0.0816 -0.0188 0.0296 0.0484 -0.0397 -0.0644 -0.0337 0.0615 0.0446 -0.0257 -0.0462 -0.0005 0.0042 -0.0045 -0.0140 0.0074 0.0169 0.0437 0.0767 -0.0874 -0.0076 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.8808B+0.9414B^2    0.9990+-0.2535i      0.9702       0.0396
## 1-1.3638B+0.9163B^2    0.7442+-0.7331i      0.9572       0.1238
## 1-1.6740B+0.9123B^2    0.9175+-0.5043i      0.9551       0.0800
## 1-0.5050B+0.9069B^2    0.2784+-1.0125i      0.9523       0.2073
## 1-0.9260B+0.8272B^2    0.5598+-0.9464i      0.9095       0.1650
## 1+0.4486B+0.8063B^2   -0.2782+-1.0783i      0.8980       0.2902
## 1+0.8944B+0.7956B^2   -0.5621+-0.9700i      0.8920       0.3336
## 1+1.2568B+0.7922B^2   -0.7932+-0.7957i      0.8901       0.3748
## 1+1.5369B+0.7872B^2   -0.9762+-0.5633i      0.8872       0.4167
## 1-0.0119B+0.7844B^2    0.0076+-1.1291i      0.8857       0.2489
## 1+1.7068B+0.7808B^2   -1.0929+-0.2936i      0.8836       0.4582
## 1+0.8734B             -1.1449               0.8734       0.5000
## 1-1.7163B+0.7391B^2    1.1611+-0.0700i      0.8597       0.0096
## 1+0.0811B             -12.3326               0.0811       0.5000
##   
## 
## 
## Coefficients of Original polynomial:  
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.4142B+1.0000B^2    0.7071+-0.7071i      1.0000       0.1250
## 1-0.5176B+1.0000B^2    0.2588+-0.9659i      1.0000       0.2083
## 1-1.9319B+1.0000B^2    0.9659+-0.2588i      1.0000       0.0417
## 1+1.0000B+1.0000B^2   -0.5000+-0.8660i      1.0000       0.3333
## 1-1.0000B+1.0000B^2    0.5000+-0.8660i      1.0000       0.1667
## 1-1.7321B+1.0000B^2    0.8660+-0.5000i      1.0000       0.0833
## 1-1.0000B              1.0000               1.0000       0.0000
## 1+0.5176B+1.0000B^2   -0.2588+-0.9659i      1.0000       0.2917
## 1+1.4142B+1.0000B^2   -0.7071+-0.7071i      1.0000       0.3750
## 1+0.0000B+1.0000B^2    0.0000+-1.0000i      1.0000       0.2500
## 1+1.7321B+1.0000B^2   -0.8660+-0.5000i      1.0000       0.4167
## 1+1.9319B+1.0000B^2   -0.9659+-0.2588i      1.0000       0.4583
## 1+1.0000B             -1.0000               1.0000       0.5000
##   
## 

## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 17    5    1   13.50918
## 11    3    1   13.51044
## 16    5    0   13.54437
## 9     2    2   13.54512
## 8     2    1   13.54600
## 
## Coefficients of Original polynomial:  
## 2.1939 -1.4928 0.2302 0.0858 -0.0273 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.8039B+0.8207B^2    1.0990+-0.1032i      0.9059       0.0149
## 1-0.6432B+0.1314B^2    2.4472+-1.2733i      0.3625       0.0764
## 1+0.2531B             -3.9503               0.2531       0.5000
##   
## 

## [1] "ASE for above seasonal model is : 2888406.40003984"

Multivariate time series analysis

## 
## Call:
## arima(x = traffic_volume_final$traffic_volume, order = c(phi$p, 0, phi$q), xreg = traffic_volume_final[, 
##     c(2:4)])
## 
## Coefficients:
##          ar1      ar2     ar3     ar4      ar5      ma1  intercept    temp
##       2.1062  -1.4667  0.2751  0.1207  -0.0803  -0.7883   761.8147  8.8332
## s.e.  0.0117   0.0191  0.0168  0.0138   0.0062   0.0103   434.9775  1.5402
##       rain_1h  clouds_all
##        0.0302      0.0600
## s.e.   0.0403      0.1419
## 
## sigma^2 estimated as 460881:  log likelihood = -234102,  aic = 468226

## Obs -0.001323695 0.001176312 0.0008094952 -0.02729275 0.03655779
## [1] 1.405653e-12
## Obs -0.001323695 0.001176312 0.0008094952 -0.02729275 0.03655779 -0.01968347 -0.04596793 -0.002236276 0.08202109 0.07318011 -0.01350892 -0.08307697 -0.02460241 0.06639319 0.06907025 -0.001318474 -0.0373869 -0.02511873 -0.007100658 -0.03238815 -0.04974949 -0.02745047 0.052335 0.165472 0.09035572 0.01027199 -0.004995573 -0.003888772 0.01449666 -0.002955487 -0.02841959 0.002375802 0.02396055 0.01382091 -0.01119042 -0.04524336 -0.02439875 0.008439027 0.02538246 0.004957913 -0.01954381 -0.01273011 0.002663449 -0.01552083 -0.008602525 -0.009270969 0.0323756 0.07157867
## [1] 0
## 
## Call:
## arima(x = traffic_volume_final$traffic_volume, order = c(phi$p, 0, phi$q), xreg = cbind(t, 
##     traffic_volume_final[, c(2:4)]))
## 
## Coefficients:
##          ar1      ar2     ar3     ar4      ar5      ma1  intercept       t
##       2.1070  -1.4675  0.2752  0.1207  -0.0802  -0.7892   861.7863  0.0032
## s.e.  0.0116   0.0191  0.0168  0.0138   0.0062   0.0103   433.4392  0.0022
##         temp  rain_1h  clouds_all
##       8.3144   0.0289      0.0505
## s.e.  1.5455   0.0403      0.1419
## 
## sigma^2 estimated as 460846:  log likelihood = -234100.9,  aic = 468225.8

## Obs -0.001366506 0.001003082 0.0006651114 -0.0272827 0.03659667
## [1] 1.369571e-12
## Obs -0.001366506 0.001003082 0.0006651114 -0.0272827 0.03659667 -0.01968326 -0.04602649 -0.002273615 0.08203495 0.07323695 -0.01345809 -0.08304797 -0.02453372 0.06649165 0.06914978 -0.001277692 -0.03737148 -0.0250877 -0.007073392 -0.03239088 -0.04981278 -0.0275641 0.05222153 0.1653656 0.09022845 0.01012151 -0.005152687 -0.003999976 0.01439389 -0.003057248 -0.02854696 0.002249962 0.02385174 0.01373148 -0.01127567 -0.04533555 -0.02446055 0.008376651 0.02533684 0.004902983 -0.01960978 -0.01278146 0.002619512 -0.01556889 -0.008672358 -0.009367844 0.03227591 0.07146556
## [1] 0
## 
## Call:
## arima(x = traffic_volume_final$traffic_volume, order = c(phi$p, 0, phi$q), xreg = cbind(traffic_volume_final[, 
##     2]))
## 
## Coefficients:
##          ar1      ar2     ar3     ar4      ar5      ma1  intercept
##       2.1060  -1.4664  0.2750  0.1208  -0.0804  -0.7881   905.4582
## s.e.  0.0117   0.0191  0.0168  0.0138   0.0062   0.0103   433.1713
##       cbind(traffic_volume_final[, 2])
##                                 8.3345
## s.e.                            1.5347
## 
## sigma^2 estimated as 460890:  log likelihood = -234102.3,  aic = 468222.6

## Obs -0.001299314 0.001212038 0.0008512073 -0.02734207 0.03652034
## [1] 1.404654e-12
## Obs -0.001299314 0.001212038 0.0008512073 -0.02734207 0.03652034 -0.01974294 -0.04597659 -0.002256005 0.08198994 0.07315742 -0.01347671 -0.08308763 -0.02462663 0.06637738 0.06898944 -0.001311623 -0.03742588 -0.02517311 -0.007201507 -0.03243947 -0.04980159 -0.027534 0.05229091 0.1653855 0.09026238 0.01033078 -0.0051034 -0.003889892 0.01443744 -0.002996019 -0.02844975 0.002346801 0.02389577 0.01381294 -0.01123498 -0.04526102 -0.02440017 0.008343268 0.02539306 0.004958378 -0.01953524 -0.01273768 0.002605641 -0.01549533 -0.008656762 -0.009268199 0.03235983 0.07151161
## [1] 0

From Ljung Box Test and from ACF of residuals (for all combinations of variables we tried), it is clear that the residuals are not white noise. Hence we decided not to use this model.

So, now let’s analyse NN model.

## [1] "ASE for above NN model is : 67122186.7034752"

Let’s analyse VAR model using the small dataset we created for NN model.

## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      5      4      2      5 
## 
## $criteria
##                 1         2         3         4         5         6         7
## AIC(n)   4.799266  4.103615  4.036316  4.001996  3.981915  3.989432  4.013313
## HQ(n)    5.066322  4.400344  4.362718  4.358071  4.367663  4.404853  4.458407
## SC(n)    5.479957  4.859938  4.868271  4.909583  4.965135  5.048284  5.147798
## FPE(n) 121.459128 60.584724 56.649635 54.747734 53.670114 54.087769 55.410058
##                8         9        10
## AIC(n)  4.027993  4.044769  4.015164
## HQ(n)   4.502760  4.549209  4.549276
## SC(n)   5.238110  5.330518  5.376545
## FPE(n) 56.246924 57.218630 55.571506

Finally since we have seen weekdays are mostly contributing to traffic volume, for simplicity, we have decided to use the data set that belongs to weekdays. Because we have seen that data our model with seasonality factor of 24 is the best model among others.

So, even though our model is not super helpful for weeknds but it is definitely going to be a huge help for weekdays forecast.

## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 11    3    1   13.68037
## 9     2    2   13.71840
## 16    5    0   13.72819
## 8     2    1   13.72971
## 10    3    0   13.73005
## 
## Coefficients of Original polynomial:  
## 0.6343 0.3927 -0.2151 -0.0382 -0.0060 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.3631B+0.4884B^2    1.3955+-0.3164i      0.6989       0.0355
## 1+0.5693B             -1.7566               0.5693       0.5000
## 1+0.1596B+0.0215B^2   -3.7068+-5.7198i      0.1467       0.3415
##   
## 

## [1] "ASE for above seasonal model is : 4727624.14434723"

## [1] "ASE for above seasonal model is : 8002835.51148372"